Debiased inference for heterogeneous subpopulations in a high-dimensional logistic regression model

Due to the prevalence of complex data, data heterogeneity is often observed in contemporary scientific studies and various applications. Motivated by studies on cancer cell lines, we consider the analysis of heterogeneous subpopulations with binary responses and high-dimensional covariates. In many practical scenarios, it is common to use a single regression model for the entire data set. To do this effectively, it is critical to quantify the heterogeneity of the effect of covariates across subpopulations through appropriate statistical inference. However, the high dimensionality and discrete nature of the data can lead to challenges in inference. Therefore, we propose a novel statistical inference method for a high-dimensional logistic regression model that accounts for heterogeneous subpopulations. Our primary goal is to investigate heterogeneity across subpopulations by testing the equivalence of the effect of a covariate and the significance of the overall effects of a covariate. To achieve overall sparsity of the coefficients and their fusions across subpopulations, we employ a fused group Lasso penalization method. In addition, we develop a statistical inference method that incorporates bias correction of the proposed penalized method. To address computational issues due to the nonlinear log-likelihood and the fused Lasso penalty, we propose a computationally efficient and fast algorithm by adapting the ideas of the proximal gradient method and the alternating direction method of multipliers (ADMM) to our settings. Furthermore, we develop non-asymptotic analyses for the proposed fused group Lasso and prove that the debiased test statistics admit chi-squared approximations even in the presence of high-dimensional variables. In simulations, the proposed test outperforms existing methods. The practical effectiveness of the proposed method is demonstrated by analyzing data from the Cancer Cell Line Encyclopedia (CCLE).


S1 Proofs
Recall that the row-wise support set is S " tj P t1, . . ., pu : β pjq ‰ 0u and the indices sets of different coefficient values on the same covariates is Ω " tpj, g, g 1 q : β pgq j ‰ β pg 1 q j u.Let w min " min jPS c w j , v min " min pj,g,g 1 qPΩ c v jgg 1 wmax " max jPS w j , ṽmax " max pj,g,g 1 qPΩ v jgg 1 .
For a matrix ∆ P R pˆG , let ∇L n p∆q :" ∇L n pvecp∆qq P R pG .For a pG-dimensional vector δ " rδ p1q , . . ., δ pGq s J consisting of p-dimensional vectors δ p1q , . . ., δ pGq , let matrδs " rδ p1q , . . ., δ pGq s P R pˆG .For a matrix ∆ and index set of columns, say C, let ∆ ¨,C denote sub-matrix of ∆ with columns in the set C. Lemma 1.Let A be a pGpG ´1q{2 ˆpG matrix consisting of an identity matrix I p and ´Ip satisfying A J A " G ˆIpG ´E b I p , where E P R GˆG is a matrix with all ones, i.e., A " ¨Ip ´Ip 0 pˆp ¨¨¨0 pˆp I p 0 pˆp ´Ip ¨¨¨0 pˆp . . .¨¨¨¨¨¨¨¨¨. . .
Proof.For 1 ď j ď p and 1 ď g ď G, we have matr∇L n pBqs jg " where ϵ pgq i :" Ery pgq i s ´ypgq i .In a matrix form, matr∇L n pBqs ¨,g " 1 n rX pgq s J E pgq , where E pgq :" pϵ pgq 1 , . . ., ϵ pgq n q J .We have that for some t ą 0, where the second equation holds due to a Taylor series expansion, the first inequality follows from the fact that pϵ pgq i q 2 ď 1, and the second inequality follows from the fact that 1 `x ď exppxq for any x.Thus, ϵ pgq i x pgq ij is sub-Gaussian with variance proxy 4px pgq ij q 2 .Hence, by the Hoeffding bound (Proposition 2.5 of Wainwright [1]), we obtain By the uniform bound, we obtain where the last inequality follows from n g ď n.Hence, with probability at least 1´1{ppGq, we have }∇L n pBq} max :" max Because each row of A has only one 1 and one ´1, respectively, we have }A∇L n pBq} max ď 2}∇L n pBq} max ď 8 a logppGq{n.This completes the proof.
Hence, we have by Lemma 1 and matr∇L n pBqs " r∇ℓ 1 pβ p1q q, . . ., ∇ℓ G pβ pGq qs, it holds that with probability at least 1 ´1{ppGq, where the second inequality follows from the conditions of λ1 and λ2 in Theorem 1, and the third inequality follows from the triangle inequality and the fact that Then, (S2) with λ2 " G ´3{2 λ1 gives the following inequality: Using the restricted strong convexity presented in Condition 2, we obtain where the third inequality follows from the fact that ab ď a 2 `b2 for any numbers a and b and the fact that }Aδ} 2 ď 2}δ} 2 .This gives Combining with the definitions of λ1 and λ2 , this completes the proof.

Proof of Theorem 2
Note that the rate conditions of λ 1 and λ 2 in Theorem 2, Condition 4 and Theorem 1 imply Suppose that Bj,¨‰ 0 for some j P S c .Denote : B by : B l,¨" Bl,¨f or l ‰ j and : B j,¨" 0 G .Because L n p¨q is convex, we have B ´Bq ˘dt, where : Q is a pG ˆpG block-diagonal matrix, consisting of G different p ˆp sub-matrices : Q pgq .Specifically, we can write : Q :" diagp : Q p1q , . . ., : Q pGq q for some βpgq , where Let δ pgq :" : B ¨,g ´B¨,g P R p .Because δ pgq l " 0 for l ‰ j, we have where the last equality follows from the fact that 0 ď e pgq i ď 1{4 and Hence, by Lemma  `λ2 ṽmax s1{2 ": where the third inequality follows from Lemma 2.
Combining with (S4), we obtain Further, by the definition of : B, we have where the last inequality follows from : B j,¨" 0 G and : B l,¨" Bl,¨f or l ‰ j.Because λ 1 w min ą Ipλ 1 , λ 2 q by (S3), (S5)-(S6) imply that the objective value of : B is less than that of B, which contradicts to the fact that B is the minimizer.Thus, we have Bj,¨" 0 G for j P S c . (S7) Next, for some j P S and d ď G, define the set A :" tg 1 , . . ., g d u Ď t1, . . ., Gu such that B j,g 1 " . . ." B j,g d and B j,g k ‰ B j,l for l R A. Suppose that Bj,g k ‰ Bj,g l for some g k , g l P A. Denote B " p βjg q 1ďjďp,1ďgďG by Bj ,¨" Bj ,¨f or j ‰ j, and βjg " ř d l"1 βj,g l {d for g P A and βjg 1 " βjg 1 for g 1 R A. Because L n p¨q is convex, we have where Q is a pG ˆpG block-diagonal matrix, consisting of G different p ˆp sub-matrices Qpgq .Specifically, we can write Q :" diagp Qp1q , . . ., QpGq q for some βpgq , where Qpgq :" ": where the last inequality follows from Lemma 2. Hence, we have Because Bj ,¨" Bj ,¨f or j ‰ j and Further, by simple algebra, we have Thus, it holds that Then, for g R A and g 1 P A, we have Combining these inequalities, we obtain where the second inequality follows from 2 and the last inequality follows from (S3).Combining with (S8), this implies that the objective value of B is less than that of B, which contradicts to the fact that B is the minimizer.Thus, βj,g 1 " . . ." βj,g d .Combining with (S7), and the minimum signal and minimum signal difference conditions in Condition 4, we obtain that B is the oracle estimator, i.e., P p Ŝ " Sq Ñ 1 and P p Ω " Ωq Ñ 1.Thus, B " Bora , where Bora :"

"
p Bora S q J , 0 G,p´s ı J .This completes the proof.
For any y P t0, 1u and ν P R, recall that ℓpy, νq " ´yν `logp1 `exppνqq.For the loss function ℓpy, νq, let 9 ℓpy, νq and : ℓpy, νq denote its first and second derivatives with respect to ν, respectively.We obtain that the optimization (7) in the main paper is feasible.Specifically, rpΣ pgq q ´1s j,¨i s a feasible solution.

Proof of Theorem 3
Recall that the debiased estimator for β pgq is bpgq :" βpgq ´M pgq n g ng ÿ i"1 where ãgi is a point between px pgq i q J βpgq and px pgq i q J β pgq , i.e., |ã gi ´px Then, we have Then, from (S9 where the equality in the third line holds due to Condition 5 and the fact that as shown in the proof of Theorem 3.2 in Van de Geer et al. [2], where the last equality follows from Conditions 1 and 5.This completes the proof.
Recall that V pgq " M pgq Σpgq p M pgq q J and Vpjq be the G ˆG diagonal matrix with diagonal elements t 1 ng V pgq jj u G g"1 .Define S j :" V ´1{2 pjq p bpjq ´βpjq q " rS j1 , . . ., S jG s J .Below we only show the proofs of Theorem 4 and Corollary 1. Proofs of Theorem 5 and Corollary 2 are the same in principle.

Proof of Theorem 4 and Corollary1
By Theorem 3, we have for 1 ď g ď G, S jg " r V pgq jj s ´1{2 ?n g p bpgq Thus, we can only consider the first part of S jg in the distribution.We can write S j " ř maxg ng i"1 Z i , where Z i 's are independent, ErZ i s " 0 G , and ř maxg ng i"1 varpZ i q " I G .By the Berry-Essen Theorem for sums of independent random vectors [3], for all measurable convex sets A Ď R G .Note that we have for some absolute constant where the last inequality holds due to the fact that Hence, we have Define the set A by Then, the set A is a measurable convex set, thus we have that for any t, where χ 2 G is Chi-squared distributed with a degree of freedom G. Thus, we obtain that for any t, This completes the proof.

S2 Additional Lemmas
The following Lemma 2 presents an estimation error bound for B. Note that in Theorem wmax ^ṽ max ą 1{p2 Cq.
Combined with the rate conditions of λ 1 and λ 2 in Theorem 2, we also have w min ^vmin ą wmax _ ṽmax , a 16Gplog p `log Gq{n ă λ 1 wmax , and a 64plog p `log Gq{pnG 2 q ă λ 2 ṽmax .For a minimizer B " p βjg q 1ďjďp,1ďgďG of the original optimization (5) in the main paper, let β :" vecp Bq P R pG , ∆ :" B ´B P R pˆG , and δ :" vecp∆q " β ´β.Let δ pgq P R p be the gth vector in δ, i.e., δ " rpδ p1q q J , . . ., pδ pGq q J s J P R pG .We have Combining this and the same arguments in the proof of Theorem 1, it holds that with probability at least 1 ´1{ppGq, where the second inequality follows from the conditions of λ 1 and λ 2 , and the last inequality follows from w min ě wmax .Because w min ^vmin ą wmax _ ṽmax and λ 2 " G ´3{2 λ 1 , (S10) gives the following inequality: Using the restricted strong convexity presented in Condition 2, we obtain where the last inequality uses }Aδ} 2 ď 2}δ} 2 and 2ab ď a 2 `b2 for any numbers a and b.This gives Combining with the definitions of λ 1 and λ 2 , this completes the proof.

Comparing the objective values at Bora
This completes the proof.

S3 Technical conditions
In this section, we provide the technical conditions required to prove Theorems 1-2.Specifically, Subsection S3.1 discusses the technical conditions required to establish Theorems 1-2, while Subsection S3.2 introduces an additional technical condition that is crucial to prove Theorems 3-5.
the following restricted strong convexity holds: F ą c for some c ą 0.
Condition 3.For j P S c and 1 ď g ď G, it holds that sup δPR s s Condition 1 assumes uniform boundness and minimum and maximum eigenvalue condition on the true support set of X pgq , which is commonly assumed in the literature [4,5,6,7].Condition 2 means that X should satisfy the Restricted Strong Convexity (RSC) condition over a restricted set A. Note that Condition 2 is different from the restricted strong convexity condition in existing high-dimensional regression models that use decomposable regularizers [8].Our regularizer is the sum of a group Lasso type and a fused Lasso type, which is not decomposable.Condition 3 is also imposed in Zheng et al. [5] with a slight modification to prove the oracle property of their estimator, which restricts the correlations between relevant covariates and irrelevant covariates.Condition 4 constrains the minimum signal strength, which is commonly assumed in the literature for variable selection [5,7,9,10].

S3.2 An additional technical condition for Theorems 3-5
We impose an additional assumption to analyze theoretically our debiased estimator bpgq .
Condition 5.There exist an absolute constant 0 ă C 2 ď 1 such that Condition 5 imposes constraints on the sparsity parameter s, the number of groups G, the number of covariates p, the sample sizes n g 's, and the norm of M pgq .In particular, the condition max 1ďgďG ~M pgq ~1 ď s 0 is crucial to show the theoretical properties of our debiased estimator.The similar condition is also imposed in Javanmard and Montanari [11] in the linear regression setting.

S4 Additional numerical results
In this section we present additional numerical results.Subsection S4.1 reports the results of sensitivity analyses of the constraint parameter in quadratic programming.Subsection S4.2 compares the proposed method with DL and DL-E, which are Lassobased bias correction methods, in terms of bias, standard error, and coverage probability for individual coefficients.Subsection S4.3 provides additional simulation results using the simulation model introduced in Section 4.1, focusing specifically on a case where the sample sizes for the groups differ.Subsection S4.4 provides the results of a sensitivity analysis of the results of the CCLE data analysis using DFGL based on parametric bootstrapping.

S4.2 Bias, standard error, and coverage probability
In this subsection, we examine the bias, standard error, and coverage probability of the proposed method and Lasso-based methods when considering individual coefficients.Tables S3 and S4 summarize the bias, standard error, and coverage probability, respectively, under the simulation models described in Section 4.1.DL, a debiased Lasso based on node-wise regression, generally produces large biases compared to other methods.DL-E, the approach based on the exact inverse of the information matrix, provides large standard errors, so it has relatively small biases and desired coverage probabilities compared to other methods.Although the proposed DFGL has relatively smaller standard errors compared to DL-E, the two methods provide similar biases when the covariates of interest have relatively weak signals.Therefore, the proposed DFGL has more power than DL and DL-E when testing the overall significance of the coefficients for a covariate with weak signals.Furthermore, although DL-E has large standard errors, DL-E fails to control for Type I errors when testing the homogeneity of the coefficients for a covariate with strong signals.In contrast to DL-E, the proposed DFGL gives type I errors close to or less than the nominal level in these cases.

S4.3 An additional simulation study for unbalanced designs
We perform an additional simulation to evaluate the performance of DFGL under the simulation models in Section 4.1 by considering different sample sizes across groups.We consider model parameters as follows: p " 120, G " 7, n 1 " . . ." n 4 " 200, but n 5 " n 6 " n 7 " 300.

S4.3.1 Testing homogeneity and significance
Tests of homogeneity and significance were performed at α " 0.05, as described in Section 4.2.Table S7 summarizes the powers for testing homogeneity and significance, while Table S8 summarizes the type I errors.Overall, the DFGL method has higher powers compared to the other methods, while keeping the type I errors close to the nominal level, except in the cases of testing the homogeneity of the effects of the covariate with the strongest signal.DL, which has low power compared to DFGL, has higher type I errors compared to DFGL in these cases.Although the four methods, DL-B, DL-E, DL-E-B, and DR-B, have type I errors below the nominal level, these methods are conservative.
H 0,j : H 0,j : In this analysis, methods with Bonferroni correction, i.e.DL-B, DL-E-B and DR-B, are excluded.This is because they are conservative, as observed in Table S7.Table S9 summarizes the results when p-values are adjusted by the BH procedure to control the familywise error rate.Compared to the results when n 1 " . . ." n G " 300, the power tends to decrease for all methods.However, the DFGL still has the highest power in all cases, while providing FWER below the nominal level α " 0.05.Similar to the results when n 1 " . . ." n G " 300, DL fails to control FWER when considering the homogeneity test (S14) and DL-E shows low power compared to the two methods, DFGL and DL.

S4.4 Sensitiviy analysis of the results of CCLE data analysis
To investigate the sensitivity of the results of CCLE data analysis using DFGL, we consider parametric bootstrapping.For each drug, we generate response variables from the logistic regression model with coefficients as the FGL estimate obtained from the CCLE data, while conditioning the design matrix.We simulate a total of B " 300 bootstrap samples.
For gene-drug pairs previously identified by applying our significance test to the CCLE data, Figure S1 shows the frequency of identification of each of these pairs at the 0.05 significance level over B " 300 replicates.We observe that some gene-drug pairs are identified relatively consistently.For example, SLFN11 is identified as a significant gene for Irinotecan and Topotecan in over 45% of the total 300 replicates, and NQO1 is identified as a significant gene for 17-AAG in about 40% of the total 300 replicates.However, some of the gene-drug pairs identified using whole cell lines in Section 5.2 are not identified in most of the bootstrap samples.This may be because we allow for type I errors by setting the significance level to α " 0.05; thus, DFGL may identify relatively less significant gene-drug pairs.For the gene-drug pairs that were not identified by applying our significance test to the CCLE data, each gene is identified on average in approximately 6%, 7%, 10%, 6%, and 6% of the B " 300 bootstrap samples for 17-AAG, AZD6244, Irinotecan, PD-0325901, and topotecan, respectively.These results suggest that, on average, these gene-drug pairs may be less significant than relatively frequently identified gene-drug pairs, including SLFN11-Topotecan SLFN11-Irinotecan, NQO1-17AAG, SIRPA-AZD6244, and ETV4-PD-0325901.The related results to the significance for SLFN11-Topotecan, SLFN11-Irinotecan, NQO1-17AAG, SIRPA-AZD6244, and ETV4-PD-0325901 were also observed in previous literature [12,13,14,15].
When considering the homogeneity test, the gene SLFN11 is identified relatively frequently for Topotecan compared to other gene-drug pairs, as shown in Figure S2.The application of a penalized mixture regression to the CCLE data [13]

S5 Additional tables
In this section, we include additional tables.

Condition 4 .
The true coefficient matrix B satisfies min jPS }β pjq } 2 ^min pj,g,g 1 qPΩ 2, we prove that B is equivalent to the oracle estimator Bora S defined at (S11) with probability tending to one.Let ϵ :" expp´2C Cq{ ´1 `expp´2C Cq ¯2 and recall that C is defined in Condition1 and C satisfies max 1ďgďG }β pgq } 1 ď C. Lemma 2. Assume that the conditions of Theorems 1 and 2 hold.Then with probability at least 1 ´1{ppGq, } B ´B} 2 F ď Bora S,¨q s g s j ˇˇ¨}δ pgq } 1 1, we have for : β :" matp : jPS ˇˇrr∇ Ln p also suggests the heterogeneous effects of SLFN11 on Topotecan across clusters.

Table S3 :
Bias, standard error, and coverage probability of the proposed method and Lasso-based methods when the covariance structure for the covariates is AR(1).S.E and C.P respectively represent standard error and coverage probability.We consider j " 6, 7, 11, 15, 20 to measure performances for zero coefficients.Here, the numbers in parentheses represent standard deviations.

Table S4 :
Bias, standard error, and coverage probability of the proposed method and Lasso-based methods when the covariance structure for the covariates is the block diagonal.S.E and C.P respectively represent standard error and coverage probability.We consider j " 6, 7, 11, 15, 20 to measure performances for zero coefficients.Here, the numbers in parentheses represent standard deviations.